Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Heartbeat] Fix summarizer #36519

Merged
merged 41 commits into from
Sep 19, 2023
Merged

Conversation

andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Sep 6, 2023

WIP fixes for summary duration. The changes in #36147 broke the summaries for browser monitors, causing the monitor.duration.us field to be attached to the journey/end event, and for a final summary event to be emitted that was missing this. monitor.duration.us must be present for all summary events.

The fix here was somewhat complex because prior to this PR much of the summary field generation was separate for browser vs. lightweight monitors. This PR unifies much of this logic into the new summarizer, thus trimming some logic from synthexec.

An additional change is the removal of a number of tests in wrappers. There are a lot of levels of abstraction in the heartbeat processing pipeline, and the wrappers level is tricky in a number of ways, primarily because it only includes some of the fields rendered. Going forward an emphasis on testing in the scenarios package and lower level unit tests (such as the additional assertions added to the summarizer tests) makes more sense in terms of both maintainability and accuracy.

Proposed commit message

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
    - [ ] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Check that browser monitors always emit a monitor.duration.us field with summaries, as do lightweight monitors.

@andrewvc andrewvc self-assigned this Sep 6, 2023
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 6, 2023
@andrewvc andrewvc added bug Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team v7.11.0 and removed needs_team Indicates that the issue/PR needs a Team:* label labels Sep 6, 2023
@mergify
Copy link
Contributor

mergify bot commented Sep 6, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @andrewvc? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Sep 6, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-09-19T04:43:33.100+0000

  • Duration: 53 min 22 sec

Test stats 🧪

Test Results
Failed 0
Passed 2558
Skipped 28
Total 2586

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@@ -85,7 +85,9 @@ func TestSockTracerWaitFail(t *testing.T) {
started := time.Now()
_, err := NewSockTracer(filepath.Join(os.TempDir(), "garbagenonsegarbagenooonseeense"), waitFor)
require.Error(t, err)
require.GreaterOrEqual(t, time.Now(), started.Add(waitFor))
// Compare unix millis because things get a little weird with nanos
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to the changes here, just noticed this flakey test

@andrewvc andrewvc marked this pull request as ready for review September 8, 2023 20:37
@andrewvc andrewvc requested a review from a team as a code owner September 8, 2023 20:37
Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found bugs with retrying logic running with max_attempts: 2.

heartbeat.run_once: true
heartbeat.monitors:
- type: browser
  id: test-hb
  enabled: true
  name: Test HB dev
  schedule: '@every 1m'
  screenshots: "off"
  max_attempts: 2
  source:
    inline:
      script: |-
        step("load homepage", async ) => {

        });
  1. Attempts for failed jobs is calculated incorrectly. Check the image and the appropriate debug message
// even for the first attempt
{"log.level":"debug","@timestamp":"2023-09-13T17:05:47.497-0700","log.origin":{"file.name":"summarizer/plugstatestat.go","file.line":175},"message":"attempt info:  == down && 2 < 2","service.name":"heartbeat","ecs.version":"1.6.0"}
Screenshot 2023-09-13 at 5 00 24 PM 2. The retry delay is wrong, The delay is actually delaying the processing of every single event from the agent instead of just waiting before starting a new continuation.

heartbeat/monitors/wrappers/summarizer/plugerr.go Outdated Show resolved Hide resolved
@@ -46,8 +45,6 @@ func (senr *streamEnricher) enrich(event *beat.Event, se *SynthEvent) error {
type journeyEnricher struct {
journeyComplete bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think this is an actual issue however. Adding data shouldn't be an issue, modifying or removing it is a different story however.

It wont break anything, but its just unnecessary adding to the storage cost. We were using those on the HB side to measure the step count and such, but not so useful on the UI layer.

}

func (bwdsp *BrowserDurationPlugin) BeforeSummary(event *beat.Event) BeforeSummaryActions {
if bwdsp.startedAt == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our previous logic which is correct behaviour never sets mon.duration for these cases.

// In case of syntax errors or incorrect runner options, the Synthetics
	// runner would exit immediately with exitCode 1 and we do not set the duration
	// to inform the journey never ran

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing enrich tests makes these cases easier to escape.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks the error logger which always expects a duration. Generally I've found that occasionally not having a field leads to more breakage. I've changed the logic however to just set the duration to zero in these cases. If you have any concerns here LMK.

@@ -46,8 +45,6 @@ func (senr *streamEnricher) enrich(event *beat.Event, se *SynthEvent) error {
type journeyEnricher struct {
journeyComplete bool
journey *Journey
errorCount int
error error
stepCount int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the journeyEnricher code is dead code, we can safely remove most of the properties like start, end, stepCount, urlFields, journeyComplete etc as is already covered in summarizer plugins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call, will remove

})

// Add step count meta for log wrapper
eventext.SetMeta(event, logger.META_STEP_COUNT, je.stepCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this has effect on the Logger where the steps would never be logged. we need this logic to be on summarizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ bringing back to plugerr.go where we count steps

Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except the duration stuff, I am running some small tests before approving the PR around hook errors and other browser checks.

Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Except for the case where there are repetition on the error field for the journey/end event.

  1. Synthetics agent generated a after error for a journey
{"type":"journey/end","@timestamp":1695066934081319.8,"journey":{"name":"testing timeout issue - regular","id":"testing timeout issue - regular","status":"failed"},"root_fields":{"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"0.0.0-development"}},"payload":{"start":1300805.898640458,"end":1300806.2263515829,"status":"failed"},"error":{"name":"Error","message":"after Error","stack":"Error: after Error\n    at /Users/vigneshh/elastic/synthetics/synth-init/journeys/debug-timeout.journey.ts:30:11\n    at /Users/vigneshh/elastic/synthetics/src/helpers.ts:117:40\n    at Array.map (<anonymous>)\n    at runParallel (/Users/vigneshh/elastic/synthetics/src/helpers.ts:117:30)\n    at Runner.runAfterHook (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:180:22)\n    at Runner.runJourney (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:369:18)\n    at Runner.run (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:452:11)\n    at Command.<anonymous> (/Users/vigneshh/elastic/synthetics/src/cli.ts:138:23)"},"package_version":"0.0.0-development"}
  1. HB produced duplicate error fields, one inside synthetics.error and error at top level.
{"@timestamp":"2023-09-18T19:41:27.009Z","@metadata":{"beat":"heartbeat","type":"_doc","version":"8.11.0","raw_index":"synthetics-browser-default"},"agent":{"ephemeral_id":"5c814a3d-05f4-4d70-be1a-7eeebce7ff1a","id":"095d58ac-9a94-4709-b918-b07cfcdb4228","name":"Vigneshs-MacBook-Pro.local","type":"heartbeat","version":"8.11.0"},"ecs":{"version":"8.0.0"},"monitor":{"check_group":"5a49b0e6-565b-11ee-9e45-b2772eebffba-1","name":"Improved Stack","id":"improved-stack","timespan":{"lt":"2023-09-18T19:42:27.397Z","gte":"2023-09-18T19:41:27.397Z"},"type":"browser"},"synthetics":{"package_version":"0.0.0-development","index":154,"payload":{"end":1299959.1694570831,"status":"failed","start":1299958.5494264159},"journey":{"name":"testing timeout issue - regular","id":"testing timeout issue - regular"},"error":{"name":"Error","message":"after Error","stack":"Error: after Error\n    at /private/var/folders/_h/tz2qn0gx7gd82wxpft62zhwh0000gn/T/elastic-synthetics-unzip-150742446/journeys/journeys/debug-timeout.journey.ts:31:11\n    at /Users/vigneshh/elastic/synthetics/src/helpers.ts:117:40\n    at Array.map (<anonymous>)\n    at runParallel (/Users/vigneshh/elastic/synthetics/src/helpers.ts:117:30)\n    at Runner.runAfterHook (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:180:22)\n    at Runner.runJourney (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:369:18)\n    at Runner.run (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:452:11)\n    at Command.<anonymous> (/Users/vigneshh/elastic/synthetics/src/cli.ts:138:23)"},"type":"journey/end"},"os":{"platform":"darwin"},"package":{"version":"0.0.0-development","name":"@elastic/synthetics"},"data_stream":{"namespace":"default","dataset":"browser","type":"synthetics"},"observer":{"ip":["fe80::b077:2eff:feeb:ffba","fe80::b077:2eff:feeb:ffb9","fe80::b077:2eff:feeb:ffb8","fe80::f84d:89ff:fe6b:ff07","192.168.0.24","fe80::5c7e:31ff:feed:b57d","fe80::5c7e:31ff:feed:b57d","fe80::6358:d49d:58e3:86be","fe80::d572:c882:417:aa46","fe80::ce81:b1c:bd2c:69e"],"mac":["36-C3-70-61-2B-00","36-C3-70-61-2B-04","36-C3-70-61-2B-08","5E-7E-31-ED-B5-7D","B2-77-2E-EB-FF-98","B2-77-2E-EB-FF-99","B2-77-2E-EB-FF-9A","B2-77-2E-EB-FF-B8","B2-77-2E-EB-FF-B9","B2-77-2E-EB-FF-BA","F8-4D-89-6B-FF-07","FA-4D-89-6B-FF-07"],"hostname":"vigneshs-macbook-pro.local"},"error":{"type":"Error","code":"","message":"error executing step: after Error","stack_trace":"Error: after Error\n    at /private/var/folders/_h/tz2qn0gx7gd82wxpft62zhwh0000gn/T/elastic-synthetics-unzip-150742446/journeys/journeys/debug-timeout.journey.ts:31:11\n    at /Users/vigneshh/elastic/synthetics/src/helpers.ts:117:40\n    at Array.map (<anonymous>)\n    at runParallel (/Users/vigneshh/elastic/synthetics/src/helpers.ts:117:30)\n    at Runner.runAfterHook (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:180:22)\n    at Runner.runJourney (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:369:18)\n    at Runner.run (/Users/vigneshh/elastic/synthetics/src/core/runner.ts:452:11)\n    at Command.<anonymous> (/Users/vigneshh/elastic/synthetics/src/cli.ts:138:23)"},"event":{"type":"journey/end","dataset":"browser"}}
  1. HB summary contains error which is correct behaviour.

Its not a blocker, but good to know why its happening.

journey('after error', ({ page }) => {
  after(() => {
    throw new Error('after Error');
  });
  step('load example', async () => {
    await page.goto('https://example.com/home');
  });
});

@vigneshshanmugam vigneshshanmugam enabled auto-merge (squash) September 18, 2023 20:34
@vigneshshanmugam vigneshshanmugam merged commit ccfa54a into elastic:main Sep 19, 2023
7 checks passed
andrewvc added a commit that referenced this pull request Sep 29, 2023
[Heartbeat] Fix missing monitor.status value in initial attempt where max_attempts > 2. Introduced in #36623 adding tests to the scenario runner as well.

Original cause was this PR #36519 that did not produce the correct monitor.status: down when the monitor is retried with the second attempt.
Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
* Cleanup summarizer code

* Separate concerns in summarizer

* Checkpoint

* Fix failing tests

* FMT

* Tweaks

* Make linter happy

* progress

* cleanup docs

* Bring back wrappers tests (partial)

* Restore wrapper tests

* Fix failing tests

* Fix err handling

* Re-init plugins on retry

* Fix error field handling across retries

* Incorporate PR feedback

* Type fix

* URLs now work, tests passing

* Improved err handling

* Test fixes

* Cleanup naming

* Fix handling of step counts / journey/end missing and also fix continuity test

* Fix failing tests around logging / logging behavior

* Rename OnRetry to BeforeRetry

* Move monitor.status calculation for browsers into summarizer

* Cleanup status logic

* More status consolidation

* Fixed failing tests

* Make monitor logger errors more understandable

* Fix retry delay

* Fix retry delay

* Remove spurious 'wrapped:' in logs

* Incorporate pr feedback

* Fix dur

* Fix cmd status

* Fix tests

* Fmt

* Integrate PR feedback

---------

Co-authored-by: Vignesh Shanmugam <[email protected]>
Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
)

[Heartbeat] Fix missing monitor.status value in initial attempt where max_attempts > 2. Introduced in elastic#36623 adding tests to the scenario runner as well.

Original cause was this PR elastic#36519 that did not produce the correct monitor.status: down when the monitor is retried with the second attempt.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants